Introduction 


A). Lee I- 


A computer program is said to learn from experience E with 
respect to some task T and some performance measure P if its 
performance on T, as measured by P, improves with experience E. 
Suppose we feed a learning algorithm a lot of historical weather 
data, and have it learn to predict weather. In this setting, what i& T? 




The probability of it correctly predicting a future date's weather. 

The process of the algorithm examining a large amount of historical weather data. 
None of these. 

The weather prediction task. 


y 

. £.. The amount of rain that falls in a day is usually measured in 
either millimeters (mm) or inches. Suppose you use a learning 


algorithm to predict how much rain will fall tomorrow. 

Would you treat this as a classification or a regression problem? 


ct 


Regression 



lassification 


Suppose you are working on stock market prediction. You would like to predict whether or 
not a certain company will win a patent infringement lawsuit (by training on data of 
companies that had to defend against similar lawsuits). Would you treat this as a 
classification or a regression problem? 


c 
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Classification 

Regression 
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^ ' ® ome °f ,fle problems below are best addressed using a supervised 
learning algorith m, and the others with an unsupervised 
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learning algorithm. Which of the following would you apply 

supervised learning to? (Select all that apply.) In each case, assume some appropriate 
dataset is available for your algorithm to learn from. 

N 

Have a computer examine an audio clip of a piece of music, and classif/whether or not 
there are vocals (i.e., a human voice singing) in that audio clip, or if it is a cUpVSniymu'sicaf 


instruments (and no vocals). => 


^ *1 of 


Given genetic (DNA) data from a person, predict the odds of him/her developing 
v diabetes over the next 10 years. * * ^ ^ jC- 5 jx*. 

r 

Given a large dataset of medical records from patients suffering from heart disease, try 
to learn whether there might be different clusters of such patients for which we might tailor 
separate treatments. 


Given data on how 1000 medical patients respond to an experimental drug (such as 
effectiveness of the treatment, side effects, etc.), discover whether there are different 

categories or "types" of patients in terms of how they respond to the drug, and if so what 
these categories are. 




Which of these is a reasonable definition 


of machine learning? 


r 

Machine learning means from labeled data. 

I 

Machine learning is the science of programming computers. 

(• 

Machine learning is the field of study that gives computers the ability to learn without 
being explicitly programmed. 

r 

Machine learning is the field of allowing robots to act intelligently. 
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Linear Regression With One Variable 



. Consider the problem of predicting how well a student does in her second True 
year of college/university, given how well they did in their first year. 

Specifically, let x be equal to the number of "A" grades (including A-. A 
and A+ grades) that a student receives in their first year of college 
(freshmen year). We would like to predict the value of y, which we define 
as the number of "A" grades they get in their second year (sophomore year). 
Questions 1 through 4 will use the following training set of a small 
sample of different students' performances. Here each row is one training 
example. Recall that in linear regression, our hypothesis is 

hfJ(x)=0Q+d\x, and we use m 

to denote the number of training examples. 
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For the training set given above, what is the value of(m^ In the box 

below, please enter your answer (which should be a number between 0 and 10). 


.. Many substances that can burn (such as gasoline and alcohol) have a chemical 
structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist 
wants to understand how the number of carbon atoms in a molecule affects how much 
energy ,s released when that molecule combusts (meaning that it is burned) The chemists 

amount of” f 0 * “* col ™“!. the unit measuring the 

amount of energy released. True ^ ^ * 1 a 


Name of molecule 

Number of hydrocarbons In 
molecule (x) 

Heat release when burned 
(kJ/mol) (y) 

methane 

1 

-890 

ethane 

2 

-1411 

ethane 

2 

-1560 

propane 

3 

-2220 

cyclopropane 

3 

-2091 

butane 

4 

-2878 

pentane 

5 

-3537 

benzene 

6 

-3268 

cycloexane 

6 

-3920 

hexane 

6 

-4163 

octane 

8 

-5471 

napthalene 

10 

-5157 


viv\*)-uu-r 0lX ) 10 estimate the amount of energy 
released (y) as a function of the number of carbon atoms (x). Which of the following do you 
think Will be the values you obtain for 0 o and ft? You should be able to select the right 
answer without actually implementing linear regression. 


0o=-569.6,0i =530.9 -J' 6 ' 1 ~" 1 -f) - - 


-te&T Oo=-\ 780.0, 01=530.9 9^1° 

-l 1 780.0, fli=~530.9 \0 ^ ' 
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^0o=-569.6,0i=-53O.9 
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Suppose we set 0o=-l ,01=0.5. What is he(4)7 Ture 06 ' : 

^ ^ h<s'4) r -) +• 4 *■ 
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. Let /be some function so that 

J{9o,Q\) outputs a number. For this problem, 

/is some arbitrary/unknown smooth function (not necessarily the 
cost function of linear regression, so/may have local optima). 

Suppose we use gradient descent to try to minimize f{ 0o,6 \ ) 
as a function of 60 and 0 1 . Which of the 
following statements are true? (Check all that apply.) True 

17 

If the learnin g rate is too small , then gradient descent may take a very long 
time to converge. Z' 



Even if the learning rate a is very large, every iteration of 
gradient descent will decrease the value of f{ 60 , 61 ). incoese K 


r~ 

If Oo and 0\ are initialized so that 60=6 \, then by-symmetry (because we do 
simultaneous updates to the two parameters), after one iteration of gradient descent, we will 
still have flo=fli. X 

^ If Oo and 0\ are (TniTTa]Ized) at 

a local minimum, then one iteration will not change their values. , 


I ) ° J 


True 
17 

If 60 and 0 1 are initialized at 

the global minimum, then one iteration will not change their values, 

No matter how Oo and 0\ are initialized, so long 
as ajs sufficiently small, we can safely expect gradient descent to converge 
to the same solution. 

(7 

If the first few iterations of gradient descent cause/(6to,#i) to 
increase rather than decrease, then the most likely cause is that we have set the 


)oh 




f/i c»ese\ 


learning rate a tcftoo large a value 
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r Setting the learning rate a to be very snJljs not harmful, and can 
only\speedTu3 the convergence of gradient descent X 

X Suppose that for some linear regression problem (say, predicting housing prices as in 
"the lecture), we i y\X J 

have some training set, and fo, our training set we managed to find some ft>. fit such 
that J( 00, 01 )= 0 . Which . 

of the statements below must then be true? (Check all that apply . )true 

II / I } l J ! 

^Gradient descent is likely to ge fiSja t a localn«mum and MtofllftWglobal 
minimum. 

r Our training set can be fit perfectly by a straight line. 

,.e.. all of our training examples lie perfectly on some straight line. A »&*'•’'** J3 ' 

r For this to be true, we must hawlfieOan? X ^ 

, , , A -rSL&fO J! <yd ..V* ' **- 

so that ho(x)=0 ~ J 






& For th ese values of 00 and 01 that satisfy .7(00,00-0. -yj ^ 

we have that //i^(x(/))=Vdjf or every training exampM^iW)) 0 ^ ^ 

r We can perfectly predict the value of y even for new examples that we have not yet 

<e g" wecan perfectly predict prices of even new houses that we have not yet seen.) 

r For this to be true, we must£a\Mo=0 and 01=0 
so that ho(x )= 0 X. 


X 


- 
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Linear Regression With Multiple Variables 



m=4 students have taken some class, and the class had a midterm exam and 

v — ^ 

a final exam. You have collected a dataset of their scores on the two exams, which is as 
follows: 


X, 


midterm exam 


89 


IX 


72 

94 

69 


(midterm exam)2 

7921 

5184 

8836 

4761 


/ final exam 

96 

74 


87 

78 


You'd like to use polynomial regression to predict a student's final exam score from their 
midterm exam score. Concretely, suppose you want to fit a model of the 
form ho(x)=Oo+dix\+02x2, where xi is the midterm score and x2 is (midterm score)2. 
Further, you plan to use both feature scaling (dividing by the ^max-min j or range, of a 
feature) and mean normalization. )C| C$> J 

3>Pp (Hint: midterm = 89, final = 96 is training example 1.) 


What is the normalized featun 
Please enter your answer in the text box below. If applicable, please provide at least two 
digits after the decimal place 


0.52 


(gjJrtdi - a woae nfa* ^ j- 
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. You run gradient descent for 1 5 iterations 


with la^OJ j and compute J{6) after each 

iteration. You find that the value of J(Q) increases over 

time. Based on this, which of the following conclusions seems 

most plausible? 

( * Rather than use the current value of a, it'd be more promising to try ^smaller yblue 
of a (say a=0.1). ^ 

Rather than use the current value of «, it’d be more promising to try a larger value 
of a (say a= 1 .0). 
r 

a- 0.3 is an effective choice of learning rate. 


Suppose you have m=2S training examples with «=4 features (excluding the additional 
all-ones feature for the intercept term, which you should add). The normal equation 

is 0={XrX)-\XTy. For the given values of m and n, what are the dimensions of d, X, and vin 
this equation? v „ 

X=2 


r 

c 

(* 

r 


JT is 28x5, jv is 28x5, 0 is 5x5 !j s f- ( 
X\s 28x4, y is 28x1, 0 is 4 x 4 q __ 

X is 28*5^ >’ is 28x1, 0 is 5x1 
X is 28x4,^ is 28x1, 0 is4x 1 
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Suppose you have a dataset with m— 1000000 examples and n=200000 features for 
each example. You want to use multivariate linear regression to fit the parameters^ to our 
data. Should you prefer gradient descenl or the norma l equa tion? 

The normal equation, since gradient descent might be unable to find the optimal 6. 
Gradient descent, since (XtX )- i will be very slow to compute in the normal equation. 
Gradient descent, since it will always converge to the optimal dfr)% 


C 

c 

r 


The normal equation, since it provides an efficient way to directly find the solution. 
n n 5 fo — } 


o 

o 

o 
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J0-- Which of the following are reasons for using feature scaling? 

r 

it speeds up gradient descent by making each iteration of gradient descen t less 
expensive to compute. 

17 

It speeds up gradient descent by making it require fewer iterations to get to a good 


solution 

r 


y° 


It is necessary to prevent the normal equation from getting stuck in local optima. 

It prevents the matrix XtX (used in the normal equation) from being non-invertable 
(singular/degenerate). 
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Consider the following training set of m=4 training examples: True 


0.5 


Consider the linear regression model he(x)=Oo+9ix. What are the values of 0o and 60 that 

you would expect to obtain upon running gradient descent on this model? (Linear regression 
will be able to fit this data perfectly.) 


r 


r 


0o=l,<9i=l -»> 2 

0o=l,0i=O.5 >- 


00=0.5,01=0 

0o=O,0i=O.5 


00=0.5,01=0.5 


r 



Suppose we set 0o=-l,0i=2. What is he( 6)? True 

Q.+0,K = -I ll 


11 




i the given figure, the cost function J(0o,0i) has been plotted against 00 and 01, as 
shown in 'Plot 2'. The contour plot for the same cost function is given in 'Plot T. Based on the 
figure, choose the correct options (check all that apply). True 


Ptots for Cost Function 



h , f ^ Sta r f , r ° m P ° int B ’ 9radient descent with a well-chosen learning rale will eventually 
help us reach at or near point A, as the value of cost function J(6 o,0i) i s minimum a t A. ^ 

r Point P ^ The 9lobal mir >imum of plot 2) corresponds to point G^of Plot 1 . X 

'l™ ^ f :° m P0int B ' gradient descent wi ^ a well-chosen learning rate will eventually 
A ^ 3 ° r near P°' nt A ’ as th e value of cost function J(0o,0 1) is maximum at point* 

h i * f We Start fr ° m P °' nt B ’ 9^ d j^ nt descent with a well-chosen learning rate will eventually X 
help us reach a. or near point^ ,he value of cos. function ***> is minimum a, point C 

Point P (the global minimum of plot 2) corresponds to poin<A)of Plot 1 . S 








Logistic Regression 


Suppose that you have trained a logistic regression classifier, and it outputs on a new example xa 
predictions^) = 0.2. This means (check all that apply): 


W 

r 

i? 

r 


Our estimate for P(y= 1 1 x\0) is 0.2. 
Our estimate for P{y=Q\x\0) is 0.2. 
Our estimate for P( y=0 \x:6) is 0.8. 
Our estimate for P{y=\\x\0) is 0.8. 



Suppose you have the following training set, and fit a logistic regression 
classifier ho(x)—g{dQ+0\x\+62X2). - * 
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Which of the following are true? Check all that apply. 




\ 
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At the optimal value of 9 (e.g., found by fminunc), we will have y(0)>O.true 


Adding polynomial features (e.g., instead 
using hd(x ) =: g'( 0to+0\x\+ 02X2 T $3X2 1 + $4x 1 X2+ 65x22 ) ) would increase J( 0) because we are now 

} Q < Ufa) <Xj 


summing over more terms. 

If we train gradjeiitdescent for enough iterations, for some examples x(i) in the training set it is 
possible to obtain h e{x(i))> XjJiBXM}\\ be a convex function, so gradient descent should 
obal minimum.)! 


converge to tb 

r 

The positive and negative examples cannot be separated using a straight line. So, 
gradient descent will fail to converge. 

Because the positive and negative examples cannot be separated using a straight line, 
linear regression will perform as well as logistic regression on this data. 

^ J(0) will be a convex function, so gradient descent should converge to th^global 
minimum.) f A f 
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For logistic regression, the gradient is given by ddOjJ ( 9)=Y/#f^ ( h 8(x( i))~~y( i) )x( i)j . Which of 
these is a correct gradient descent update for logistic regression’ with a learning rate o/a^ Check all 


r 

F 

F 


V 6:-(ha\m^\mi= \(0rx-y{i))x(i). 1 5 

'^Bi\=di— a\m 'ymi= l ( ho(x< i))~y( i\)x< i) (simultaneously update for all /). 
Bj : = Oj—a/m 1 ‘ ) )x U)j (simultaneously update for all/)., 

Bj : 1 ( I /+ e-m\y^-y{ i ) )x(i)j (simultaneously update for all /). 
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Which of the following statements are true? Check all that apply. 


The sigmoid function g(z)= j/l +<?-* is never greater than one (>1 ).t rue 

u classify t 

lo ^ p** 



P 

r 

Linear regression always w orks well for classification if you classify by using a th resh old on the 
pre diction made by linear regression. 

P 

The cost function J(9) for logistic regression trained with m> 1 examples is always greater than 
or equal to zero. true 

For logistic regression, sometimes gradient descent will converge to a local minimum (and fail to 
find the global minimum). This is the reason we prefer more advanced optimization algorithms such 
as fminunc (conjugate gradient/BFGS/L-BFGS/etc). 

r 

\rMv 

Since we train one classifier when there are two classes, we train tw o clas sifiers when 
there are three classes (and we do one-vs-all classification). 

& o 




The one-vs-all technique allows you to use logistic regression for problems in which 
each ^(0 comes from a fixed, discrete s et of values, true 



Figure: true 





Suppose you train a logistic classifier he(x)=g(6o+O\x\+02x2). Suppose #o=-6,#i= 1,02=0. 
Which of the following figures represents the decision boundary found by your classifier? 



Figure: 
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greater than or equal to zero. 



Suppose you train a logistic classifier ho(x)=g(9Q+0\x\+d2X2). Suppose 6^0=— 6,^1 —0,6^2= 1 . 
Which of the following figures represents the decision boundary found by your classifier? 
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Regularization 


1 . You are training a classification model with logistic regression. Which of the following 

'At: 


statements are true? Check 
all that apply. 


^ Adding a new feature to the model always results in equal or better performance on examples not 
in the training set. 

r 

Introducing regularization to the model always results in equal or better performance on 
examples not in the training set. 

r 

Introducing regularization to the model always results in equal or better performance on the 
training set. 

!✓ 

Adding many new features to the modehixakes it more likely to overfit the training set. 


Z. Suppose you ran logistic regression twice, once with /l=0, and once with X—\ . One of the times, 
you got 

parameters #=[23.4 37.9], and the other time you got 
0=[ 1 .03 0.28], However, you forgot which value of 

X corresponds to which value of #. Which one do you 
think corresponds to X— 1 ? 


#=[1.03 0.28] 

#=[23.4 37.9] 



Which of the following statements about regularization 


are 


true? Check all that apply. 


Using too large a value of A can cause your hypothesis to underfit the data. 
r~ 

Using a very large value of X cannot hurt the performance of your hypothesis; the only reason we 
do not set X to be too large is to avoid numerical problems. 

Because regularization causes ,/(#) to no longer be convex, gradient descent may not always 
converge to the global minimum (when 2>0, and when using an appropriate learning rate a). 




§1 








Because logistic regression outputs values 0</jfl(x)<l, it's range ot output values can only be 
"shrunk" slightly by regularization anyway, so regularization is generally not helpful tor it. 


^ Consider a classification problem. Adding regularization may cause your classifier to 
incorrectly classify some training examples (which it had correctly classified when not using 
regularization, i.e. when A=0). 


4 . In which one of the following figures do you think the hypothesis has overfit the training set? 


Figure: 




Regularization | Coursera 



O Figure: 



1 

point 

5 . 

In which one of the following figures do you think the hypothesis has 
underfit the training set? 

O Figure: 


https://www.coursera.org/learn/machine-learning/exam/lehkt/regularization 
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O Figure: 



O Figure: 


https://www.coursera.org/learn/machine-learning/exam/lehkt/regularization 
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3 questions unanswered 

Submit Quiz 


(2/ Q P 


https://www.coursera.org/learn/machine-learning/exam/lehkt/regularization 
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Unsupervised Learning 


,A . For which of the following tasks might K-means clustering be a suitable algorithm? 
^Select all that apply.true 


Given historical weather records, predict if tomorrow's weather will be sunny or rainy. 
[7 

From the user usage patterns on a website, figure out what different groups of users 
exist. 

[~ 

Given many emails, you want to determine if they are Spam or Non-Spam emails. 

[7 

Given a set of news articles from many different news websites, find out what are the 
main topics covered. 


true __ . 

(7 

Given a database of information about your users, automatically group them into 

different market segments. 

17 

Given sales data from a large number of products in a supermarket, figure out which 
products tend to form coherent groups (say are frequently purchased together) and thus 
should be put on the same shelf. 

r 

Given historical weather records, predict the amount of rainfall tomorrow (this would be a 
real-valued output) 

r 

Given sales data from a large number of products in a supermarket, estimate future 
sales for each of these products. 


True 


W 


From the user usage patterns on a website, figure out what different groups of users 


exist. 

(7 

Given a set of news articles from many different news websites, find out what are the 
main topics covered. 


Given historical weather records, predict if tomorrow's weather will be sunny or rainy. 
^ Given many emails, you want to determine if they are Spam or Non-Spam emails. 


2. Suppose we have three cluster centroids n\=[ 1 2], //2=[— 3 0]and/*3=[4 2J. 

Furthermore, we have a training example *(/)=[ 1 -2J. After a cluster assignment step, what 
will c(i) be? 


C 

r 

r 

a 


c(0 is not assigned 

C(/)= 3 
c(0=2 
c(/)=l 


2. Suppose we have three cluster centroids Ai=fl 2],//2=[-3 0]and,«3=[4 2 ], 

Furthermore, we have a training example x(/)=[3 1 ]. After a cluster assignment step, what 
will c(i) be?True 

r 

c(/) is not assigned 

c(i )= 3 

r 

c< o=2 

r , 

c(/)=l 


3 . K-means is an iterative algorithm, and two of the following steps are repeatedly carried 
out in its inner-loop. Which two? true 

Move each cluster centroid /uk, by setting it to be equal to the closest training examplex(i) 

The cluster centroid assignment step, where each cluster centroid /ui is assigned (by 
setting c(0) to the closest training example x(i). 

The cluster assignment step, where the parameters c(/) are updated. 

Move the cluster centroids, where the centroids /uk are updated. 

Test on the cross-validation set. 

Randomly initialize the cluster centroids. 

Suppose you have an unlabeled dataset {*(i),...,x(m)}. You run K-means with 50 
different random initializations, and obtain 50 different clusterings of the 

data. What is the recommended way for choosing which one of 
these 50 clusterings to use?True 


F 

F 

r 

F 

4. 


Compute the distortion function and pick the one that minimizes 

this. 

C 

Plot the data and the cluster centroids, and pick the clustering that gives the most 
"coherent" cluster centroids. 

c 

Manually examine the clusterings, and pick the best one. 

r 

Use the elbow method. 


The only way to do so is if we also have labels y(i) for our data. 

Always pick the final (50th) clustering found, since by that time it is more likely to have 

converged to a good solution. 

c 

For each of the clusterings, compute imX/w/=i||x(/)-^c(o|| 2 , and pick the one that 
minimizes this, 
r 

The answer is ambiguous, and there is no good way of choosing. 


5. Which of the following statements are true? Select all that apply, true 

f 

Once an example has been assigned to a particular centroid, it will never be reassigned 
to another different centroid 
W 

A good way to initialize K-means is to select K (distinct) examples from the training set 
and set the cluster centroids equal to these selected examples. 

P 

On every iteration of K-means, the cost function J(c (\ (the distortion 
function) should either stay the same or decrease; in particular, it should not increase, 
j- 

K-Means will always give the same results regardless of the initialization of the 
centroids. 

True 

The standard way of initializing K-means is setting /ui =•••=//* to be equal to a vector of 
zeros. 


Since K-Means is an unsupervised learning algorithm, it cannot overfit the data, and 
thus it is always better to have as large a number of clusters as is computationally feasible. 

^ For some datasets, the "right" or "correct" value of K (the number of clusters) can be 
ambiguous, and hard even for a human expert looking carefully at the data to decide. 

P 

If we are worried about K-means getting stuck in bad local optima, one way to 
ameliorate (reduce) this problem is if we try using multiple random initializations. 


